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Abstract 

Background: cDNA libraries are widely used to identify genes and splice variants, and as a physical 
resource for full-length clones. Conventionally-generated cDN A libraries contain a high percentage 
of 5'-truncated clones. Current library construction methods that enrich for full-length mRNA are 
laborious, and involve several enzymatic steps performed on mRNA, which renders them sensitive 
to RNA degradation. The SMART technique for full-length enrichment is robust but results in 
limited cDNA insert size of the library. 

Results: We describe a method to construct SMART full-length enriched cDNA libraries with 
large insert sizes. Sub-libraries were generated from size-fractionated cDNA with an average insert 
size of up to seven kb. The percentage of full-length clones was calculated for different size ranges 
from BLAST results of over 1 2,000 5'ESTs. 

Conclusions: The presented technique is suitable to generate full-length enriched cDNA libraries 
with large average insert sizes in a straightforward and robust way. The representation of full- 
coding clones is high also for large cDNAs (70%, 4-10 kb), when high-quality starting mRNA is 
used. 



Background 

Full-length cDNA clones are indispensable tools for func- 
tional genomics [1], cDNA libraries are widely used to 
identify genes and splice variants and as a physical 
resource for full-length clones. Unfortunately, cDNA 
libraries constructed according to conventional methods 
[2] contain a high percentage of 5' truncated clones due to 
the premature stop of reverse transcription (RT) of the 
template mRNA. This is especially true for large mRNAs 
and those tending to form secondary structures. In addi- 
tion, there is a size bias against large fragments inherent in 
the cloning procedure. For these reasons, large full-length 



cDNAs are strongly . underrepresented in conventional 
libraries. Several methods have been developed to con- 
struct cDNA libraries that are enriched for full-length 
cDNAs. Most are based on either RNA oligo ligation to the 
5' end of mRNA [3,4], 5' cap affinity selection via eukary- 
otic initiation factor 4E [5], or 5' cap biotinylation fol- 
lowed by biotin affinity selection [6,7]. Common to these 
methods is that they are laborious and contain several 
enzymatic steps that must be performed on mRNA. There- 
fore, they are sensitive to quality loss through RNA degra- 
dation. Furthermore, they require high amounts of 
starting mRNA (5-100 |Xg depending on method). 
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In contrast, using SMART technology for full-length 
enrichment of cDNA is very straightforward and robust 
and requires only 0,025-1 ug of starting mRNA [8], This 
technology utilizes the property of some MMLV reverse 
transcriptases to add a few C residues at the 3' end of the 
first strand cDNA when they reach the end of the mRNA 
template, but not at prematurely terminated reverse tran- 
scripts. An RNA oligo ending in three G residues and 
present in the reaction, forms base pairings with the 
added Cs and serves as a prolonged template for reverse 
transcription. By these means, full-length cDNAs but not 
prematurely terminated ones are 5 '-tagged and can be 
amplified by an RNA oligo-specific primer (Figure 1, step 
1 and 2). The percentage of full-length clones in libraries 



constructed with the .SMART technique is much higher 
compared to conventional libraries [8] and, when tran- 
scripts up to 3 kb are compared, better than libraries con- 
structed with other full-length enriching techniques [9]. 
However, large clones are rarely found in SMART libraries 
as well as in libraries constructed according to the other 
full-length enriching techniques [8,9], unless specialized 
lambda vectors are used [10]. We modified the proven 
and robust SMART technique to construct cDNA libraries 
with large average insert sizes in convenient plasmid vec- 
tors. We here report the construction of size fraction sub- 
libraries enriched for full-length clones having an average 
insert size of up to 7 kb and the analysis of full-length per- 
centage for these libraries. 
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Figure I 

Library construction process Schematic representation of the library construction process. Fist strand cDNA is synthe- 
sized with the SMART system (1), second strand synthesis is primed by a SMART oligo-specific primer (2), double-stranded 
cDNA is size-fractionated via agarose gel electrophoresis (3), and size fractions are amplified and cloned separately (4). 
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Results and Discussion 

Generation of size-fractionated futt-length enriched cDNA 

cDNA synthesis according to the SMART protocol is as 
convenient as conventional first strand synthesis. There is 
no need for any mRNA manipulative step prior to the 
reaction and the only difference is the presence of the 
SMART RNA oligo in the reaction (figure 1, step 1 and 2). 
Synthesis must be done with a MMLV RTase that is 
RNaseH negative to ensure addition of C residues, and to 
prevent SMART RNA oligo degradation during base pair- 
ing with these residues. The full-length selective step is the 
PCR amplification following cDNA synthesis, therefore, 
there is a bias against large cDNAs, as smaller cDNAs are 
preferentially amplified during PCR (personal observa- 
tion). In our strategy, cDNA is size-fractionated prior to 
this PCR amplification step and PCR is performed with 
the different size fractions in separate reactions (figure 1, 
step 3 and 4), Because large cDNAs are less frequent, more 
PCR cycles must be done on the large fractions compared 
to smaller fractions to obtain an equivalent amount of 
PCR product for cloning. But, to avoid increasing redun- 
dancy and to reduce errors introduced by PCR polymer- 
ase, as few PCR cycles as possible must be performed. We 
typically did 12 to 16 cycles, depending on size fraction. 
By these means, large cDNAs could be amplified as effi- 
ciently as smaller ones. These large cDNAs are strongly 
underrepresented in control amplification products of 
unfractionated cDNA (figure 2, panel A). 

Polymerase error rate is a major concern in PCR-based 
library construction techniques. Therefore, it is crucial to 
perform as few PCR cycles as possible, as each duplication 
increases the number of introduced errors by a factor of 
two, assuming a constant error rate of the used polymer- 
ase. The Expand™ PCR System we used was tested to have 
an error rate of 8,5 x 10 6 [11]. Starting with PolyA-h RNA 
we could restrict the number of cycles to 12 to 16. Lev- 
esque et a!., who also combined SMART cDNA amplifica- 
tion with size fractionation, startet with total RNA and did 
45 to 47 cycles in total. In contrast to our approach, where 
amplification follows size fractionation, they did 33 cycles 
before and 12-14 after fractionation. In their study, the 
obtained sub-libraries were not analysed for insert size 
range, instead, they screened them with three gene-spe- 
cific probes [12]. 

insert size of libraries 

In conventionally-constmcted libraries, large insert clones 
are rarely found. This is because very long transcripts often 
get truncated during cDNA synthesis, and because there is 
a strong size bias against large fragments inherent in the 
cloning procedure, i.e. ligation and bacterial transforma- 
tion. In our strategy, PCR-arnplified cDNA size fractions 
are restriction digested and separately cloned into a plas- 
mid vector to obtain size fraction sub-libraries. To analyse 



panel A panel B 




Figure 2 

Quality control of cDNA and sub-libraries Panel A: 
Amplified cDNA size fractions. cDNA was size-fractionated 
and separate size fractions (1^1) were PCR-amplified. U = 
unfractionated control, size marker in kb. Panel B: Insert 
analysis of sub-libraries obtained by cloning of the amplified 
size fractions shown in panel A, Piasmid DNA of clone pools 
of 5,000-10,000 clones was restriction digested by Sfi 1. An 
arrow head indicates the vector band. The smear corre- 
sponds to the insert size range of the the sub-library. 



the range of insert sizes within these sub-libraries, clone 
pools of 5000-10,000 clones were grown in semi-solid 
agar and piasmid restriction digests of the clone pools 
were performed. Each sub-library almost exclusively con- 
tains inserts within the size range of the corresponding 
cDNA size fraction that was cloned to produce this sub- 
library {figure 2, panels A and B). In sub-library 1 for 
example, most inserts are between 6 and 8 kb. Such inserts 
are rarely found in conventional libraries. 

Full-length content 

The full-length enriched cDIMA sub-libraries generated 
according to the protocol described here serve as clone 
resource for the cDNA sequencing efforts of the German 
cDNA Consortium http://www.dkfz-heidelberg.de/mga/ 
g roups . asp ?sitelD=48 [ 13]. Within this project, over 
100,000 5'ESTs have been generated. All sequences are 
submitted to public databases and clones are available 
through the German Resource Center for Genome 
Research http://www.rzpd.de . To determine the full- 
length cDNA content of our libraries, 5'ESTs were blasted 
against human RefSeq sequences according to parameters 
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specified in the Methods section. The total number of hits 
to known mRNAs were set as 100% and the percentage of 
clones containing the 5* end of the hit was calculated. 
Accordingly, full-ORF content was determined by BLAST 
analysis against the SWISSPROT database. 

Figure 3 shows full-cDNA and full-ORF percentages of 
three size fraction sub-libraries made from an 
endometrium carcinoma cell line. With 3827 5'ESTs 
blasted in total for these three sub-libraries, 1439 hits to 
known mRNAs were found in human RefSeq and 513 hits 
to known proteins were found in SWISSPROT. For calcu- 
lation of full-cDNA/mll-ORF percentage, only hits within 
the corresponding size range of the sub- library were taken 
into consideration. Full-length percentages range from 
46% to 59%, and full-ORF percentages from 63% to 76%, 
depending on sub-library. Full-length content does not 
decrease significantly with increasing cDNA/ORF size, as 
it is observed in conventional cDNA libraries. In the sub- 
library containing 4-10 kb inserts, the percentage of full- 
coding clones is still almost 70%, which is extremely high 
for this size range {figure 3). 



Unlike other full-length enriching protocols, there is no 
negative selection against truncated mRNA molecules in 
the SMART technique, because the basic principle is selec- 
tion for full-reverse transcribed mRNA molecules rather 
than mRNA cap selection. Therefore, mRNA quality is cru- 
cial. The starting mRNA of the library, which full-length 
analysis is shown in figure 3, was of highest quality, i.e. in 
a control agarose gel mRNA smeard up to 10 kb. Figure 4 
shows the analysis of sequence data from over ten differ- 
ent libraries made from rnRNAs of various qualities. With 
50,023 5'ESTs blasted in total, 12,208 hits to known pro- 
teins were found in the SWISSPROT database. Size win- 
dows shown correspond to the size of the SWISSPROT 
hits. For better orientation, the calculated corresponding 
mRNA/cDNA size is also shown. Here, full-ORF content 
decreases with increasing mRNA size, as can be expected 
due to the fact that for large transcripts there is a higher 
probability of truncated molecules in the starting mRNA. 
For transcripts up to 3 kb 60-70% contain the complete 
ORF. This number gradually decreases to 30% for 5-6 kb 
and 20% for 7 kb. Although these numbers are still rea- 
sonably high bearing transcript size in mind, they are 
lower than in the library shown in figure 3. Probably, this 
is due to lower quality of starting mRNAs. 



80 
70 
60 
- 50 
5 40 
30 
20 
10 



17279" 



248/421 




90/130 



66/104 



170/369 



□ % full-cDNA 

a % fuii-ORF 



2-3 3-4 4-10 

insert size of sub-library in kbp 



Figure 3 

Completeness of cDNA and coding region in different size fraction sub-libraries of a cDNA library Sequences 
from an endometrium carcinoma cell line library were analysed. Completeness of cDNA/ORF was calculated by BLAST analy- 
sis against human RefSeq ("full-length") / SWISSPROT ("full-ORF") database, respectively, with parameters specified in the 
Methods section. 3827 5'ESTs were blasted, 1 439 hits were found in human RefSeq, and 5 1 3 hits in SWISSPROT. Percentages 
of full-length/full-ORF hits were calculated. 
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Figure 4 

Completeness of coding region in different size ranges Completeness of ORf" was calculated by BLAST against the 
SWISSPROT database, with parameters specified in the Methods section. The lower X-axis indicates the length of the SWISS- 
PROT hit in amino acids. The upper X-axis indicates calculated corresponding cDNA lengths, assuming an average 3' UTR 
length of 525 bp [2 1]. 



cDNA size fractionation has been used previously in two 
studies to enrich cDNA libraries for full-length clones 
[12J4). In both studies, the sub-libraries were not 
analysed for insert sizes, hi consequence, it remains 
unclear, if the sub-libraries actually contained the 
expected range of insert sizes. Levesque et at. [12] also 
combined the SMART technique with cD NA size fraction- 
ation, but did not analyse the overall full-length content, 
instead, they screened the libraries with three gene-spe- 
cific probes. Draper et al. [ 14] calculated the percentage of 
full-coding clones in size fractionated libraries from 
BLAST results of 78 hits in total and down to 3 hits per size 
range. We calculated the percentage of full-coding clones 
in the libraries generated according to the presented 
method from BLAST results of over 12,000 hits in total 
and between 99 and 3363 hits per size range. The high 
number of hits for a given size range permit a much more 
reliable calculation of full-length percentages compared 
to former studies. Furthermore, because of the large insert 
size of our sub-libraries, large size ranges can be analysed 



(up to 10 kb), which had not been analysed before in sim- 
ilar studies [8,9,14]. 

Conclusions 

The method presented is attractive for the construction of 
full-length enriched cDNA libraries with large average 
insert sizes for several reasons. First, there is no additional 
enzymatic step for the enrichment, which saves time. Sec- 
ond, it is easy-to-use, as enzymatic steps performed on 
mRNA, which are necessary in other full-length enriching 
techniques, are extremely critical in terms of mRNA degra- 
dation and quantity loss. Third, the cDNA sizing protocol 
presented is very efficient and can be performed with basic 
laboratory equipment. cDNA libraries constructed accord- 
ing to the method presented also yield high full-length 
percentages for large cDNAs/ORFs when high quality 
starting mRNA is used. 
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Methods 
cDNA synthesis 

First strand cDNA was synthesized from 1 fig of mRNA 
with the "SMART cDNA Library Construction Kit" (CI on- 
tech) in a 10 ul reaction according to the manufacturers 
protocol. In this reaction, a fraction of full-transcribed 
first strand cDNA molecules but not truncated cDNAs is 
tagged with a short sequence complementary to the 
SMART oligo. The SMART oligo sequence (AAGCAGTGG- 
TATCAACG CAGAGTGG CGATTACG G CCG GG ) and the 
overhang of the oligo (dT) primer 
( A1TCTAG AG GCCG AG G CG GCCG AGATG fdT] 30 VN) 
used for first strand synthesis both include a Sfll restric- 
tion site. After first strand synthesis, 40 pmol of 5 l PGR 
primer (corresponding to the SMART oligo sequence) was 
added and first strand cDNA was denatured for 5 min at 
95 °C. The reaction was cooled to 60° C and second strand 
reaction mix was added to give a final concentration of Ix 
PCR reaction buffer (Expand 20 kb PLUS PCR System, 
Roche), 0.5 mM dNTPs, and 8.3 U/ul Expand 20 kb PLUS 
enzyme mix (Roche) in a volume of 60 ul. 'This second 
strand reaction mixture was incubated for 3 cycles of 15 
min 60 °C and 15 min 68 °C. The second strand reaction 
was phenol-extracted and cDNA was precipitated from the 
aquous phase with 1/2 volume.7.5 M ammonium acetate 
and 2.5 volumes of 100% ethanol. The washed pellet was 
dried and suspended in 10 ul of water. As a quality con- 
trol, 1 ul was electrophoresed on an 1% agarose gel 
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Size fraction ation of cDNA 

Double-stranded cDNA alongside with a DNA ladder was 
subjected to size fractionation on two consecutive agarose 
gels: cDNA was separated on 0.7 % agarose gel and the 
size fraction 1.5-20 kb excised from the cDNA lane and 
from the DNA ladder lane. The gel slices were then rotated 
by 90 ° C and placed in a gel tray. A 1 A % low-melting aga- 
rose gel was cast around the slices. Electrophoresis was 
performed over night at 37 V and the DNA ladder "lane" 
was stained and photographed with a ruler. 3-6 size frac- 
tions were excised from the unstained cDNA "lane" 
according to the DNA ladder "lane" (figure 5). cDNA was 
extracted from the gel slices with agarase (gelase, Epicen- 
tre) according to the manufacturers instructions. After 
gelase digestion, the reaction was phenol extracted, the 
aquous phase incubated on ice for 15 min, and centri- 
fuged at 4 0 C with maximum speed for 15 min. cDNA was 
precipitated from the supernatant with 1/2 volume 7.5 M 
ammonium acetate, 1 ul PelletPaint (Novagen), and 2.5 
volumes of 100% ethanol. Washed pellets were dried and 
suspended in 10 pi of water. 

PCR amplification of cDNA size fractions 

One ul of each cDNA fraction was amplified in a 10 ul 
reaction containing a final concentration of lx PCR reac- 
tion buffer (Expand 20 kb PLUS PCR System, Roche), 0.5 



Figure 5 

cDNA size fractionation Alongside with the cDNA, a 
DNA ladder is size-fractionated and stained. The ladder is 
photographed with a ruler and cutting edges are marked 
{thin dotted lines). The unstained cDNA is cut from the ge! 
accordingly. 



mM dNTPs, 0.5 pmol/uJ forward primer (AAGCAGTGG- 
TATCAACGCAG AGT) , 0.5 prnol/ul reverse primer 
(ATTCTAGAGGCCGAGGCGGCCGACATG), and 8.3 U/ 
Expand 20 kb mfS enzyme mix (Roche). To perform 
manual hot start, the reactions were prepared in two mas- 
ter mixes, one containing buffer and enzyme, the other 
containing dNTPs, primer, and cDNA. The two master 
mixes were combined at 92 °C. After initial denaturation 
at 92°C for 3 min, 12-16 cycles (depending on size frac- 
tion and second strand cDNA quality and intensity) of 
92 0 C 1 0 sec and 68 0 C 1 4 min were perform ed. PCR prod- 
ucts were analysed on agarose gel and PCR was repeated 
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in a 50 pi volume with 5 \x\ cDNA and fine-tuned cycle 
number (i.e. reduced for intensive products and increased 
for weak signals). Five ul of 50 were analysed on an agar- 
ose gel The remaining reaction was proteinase K digested, 
phenol extracted, and precipitated. 

Cloning and quality control of sub-libraries 

The precipitated amplified cDNA was Sfifdigested in a 40 
ul volume. The Sfil digest was gel-purified using low-melt- 
ing agarose and gelase (Epicentre). DNA was suspended 
in 10 ul water and concentration was determined using 
the FicoGreen reagent (Molecular Probes). 20 fmol of 
cDNA was ligated to 10 fmol Sfi-digested pSPORTl_Sfi 
vector (a modified pSPORT vector having the part of the 
MCS between Kpni and Hindlll exchanged by the corre- 
sponding part of the pTriplEx2 MCS, so that it contains 
Sfil sites). For quality control 5,000-10,000 clones were 
grown in semi-solid agar (SeaPrep agarose, BMA), centri- 
fuged, plasmid DNA was extracted from these clone pools, 
Sfil-digested, and analyzed on an agarose geL If the qual- 
ity was satisfactory, 96 single clones were picked and 
insert analysis was performed as with the clone pools. 

Examination of full-length clone content 

Libraries were arrayed in 384-well plates and clones were 
randomly sequenced from the 5' end. 5'ESTs longer than 
150 bp were compared to public databases using the 
BLAST algorithm [15,16] within the Heidelberg Unix 
Sequence Analysis Resources (HUSAR; hup:// 
ge nome.dkf z-heidelberg.de/) [17], 

5 1 ESTs were compared to a human subset of RefSeq [18] 
by BLAST (default parameters, except a wordsize of 20 bp 
was used) to calculate the percentage of full-length cDNA 
clones. The BLAST outputs were further analysed with the 
following criteria to find the maximum scoring RefSeq 
entry: Minimum HSP length of 50 bp, start of HSP within 
the first 100 bp of 5'EST, end of HSP within the last 15% 
of 5'EST length, sequence identity within HSP at least 
95%. If several HSPs within die same hit fit these criteria, 
the more upstream match was chosen. A clone was 
defined as "ru]l length 0 , when the 5'end of the 5'EST was 
upstream or up to 50 bp downstream of the start of the 
corresponding RefSeq entry. This last criteria was chosen 
to take into account the fact that transcription start site is 
variable for most genes [19]/ or even unknown. 

To calculate the percentage of fulLORF clones, a BLAST 
search of the 5'ESTs against the SWISSPROT database [20] 
was performed with default parameters. HSPs with a 
length less than 20 amino acids and sequence identity 
below 75% were filtered out. A clone was calculated as 
full-ORF, when the most upstream HSP of the maximum 
scoring hit contained the first amino acid of the SWISS- 
PROT entry. 
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SMART = Switching Mechanism At 5 1 end of RNA 
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ORF = open reading frame 

RT = reverse transcription 
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